home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1994 March
/
Internet Info CD-ROM (Walnut Creek) (March 1994).iso
/
networking
/
terms
/
kermit
/
charsets
/
cyrillic-summary.txt
< prev
next >
Wrap
Internet Message Format
|
1993-03-12
|
29KB
From: Andras Kornai <andras@calera.com>
Subject: Re: One more kermit question
To: fdc@watsun.cc.columbia.edu (Frank da Cruz)
Date: Thu, 11 Mar 93 21:33:45 PST
----------------------------------------------------------------------
CYRILLIC ENCODING FAQ Version 1.3, March 13 1993
ACKNOWLEDGEMENTS Most of the information was provided by the following:
David J. Birnbaum <djbpitt+@pitt.edu>
Frank da Cruz <fdc@watsun.cc.columbia.edu>
Bur Davis <bdavis@adobe.com>
George Fowler <gfowler@ucs.indiana.edu>
Richard B. Paine <RPAINE@CCNODE.Colorado.EDU>
Slava Paperno <PAPY@CORNELLA.cit.cornell.edu>
Keld J. Simonsen <Keld.Simonsen@dkuug.dk>
Glenn E. Thobe <thobe@getunx.info.com>
Dimitri Vulis <DLV@CUNYVMS1.BITNET>
Johan W. van Wingen <precal@rulmvs.leidenuniv.nl>
Thanks to all who contributed -- I am responsible for the errors that
still remain.
Andras Kornai (andras@calera.com, kornai@csli.stanford.edu)
Q: What are the commonly used computer encodings for Cyrillic?
A: Broadly speaking, there are three kinds of schemes in use: those that
replace Cyrillic characters by 7-bit ascii values, those that use the
full 8-bit range 0-255, and those using multi-byte codes. Presently
only the first two types are in wide use, but for reference purposes I
will also discuss the third type.
Q: What kind of transliteration schemes are there?
A: The most important one is called KOI-7: the Russian alphabet is given
by the ASCII characters (note the exchange of upper and lower cases):
UPPER CASE: abwgde$vzijklmnoprstufhc~{}"yx|`q
lower case: ABWGDE#VZIJKLMNOPRSTUFHC^[]_YX\@Q
The following extensions to the official standard KOI-7 are supported in
Glenn Thobe's conversion programs for invertibility: '"'=YER, '#'=yo,
'$'=YO, '<'=left guillemet, '>'=right guillemet.
A slightly different (multicharacter) scheme is employed by Steve
Gaardner's (gaarder@theory.tc.cornell.edu) conversion code from Old
KOI-8, included below. This particular scheme provides easy
readability but suffers from some transliteration weirdness, such as
mapping short ii and yeri on the same character. Since proper
transliteration often requires context-sensitive rules, and differs
from language to language within the same script, a fuller discussion
is beyond the scope of the present document. For an overview of the
major Cyrillic to Latin transliteration schemes used in the US, see pp
457-460 of the Style Manual of the US Government Printing Office, for
sale by the Superintendent of Documents, USGPO, Washington DC 20402,
Stock Number 021-000-00120-1 (paper) or 021-000-00120-0 (hardbound).
See also the Chicago Manual of Style, and Transliteracija russkikh
slov latinskimi bukvami, GOST 167876-71
#include <stdio.h>
char transtbl[64][5] =
{"yu", "a", "b", "ts", "d" , "e", "f", "g", "kh", "i", "y" , "k", "l",
"m", "n", "o", "p", "ya", "r" , "s", "t", "u", "zh", "v", "'",
"y", "z", "sh", "e", "shch", "ch", "`",
"YU", "A", "B", "TS", "D" , "E", "F", "G", "KH", "I", "Y" , "K", "L",
"M", "N", "O", "P", "YA", "R" , "S", "T", "U", "ZH", "V", "'",
"Y", "Z", "SH", "E", "SHCH", "CH", "`" };
main()
{
int c;
while ((c = getchar()) != EOF)
{ if ( c > 0x80) c -= 0x80;
if ( c < 0x40) putchar(c);
else printf("%s",transtbl[c-0x40]);
}
}
Q: What are the eight-bit schemes?
A: For the IBM mainframe world, which includes the ES (edinnaja sistema)
clones of 360-370 mainframes, the basic scheme, called DKOI-8, extends
EBCDIC by putting the Cyrillic letters in the unused slots, mostly in
the rectangle 0x8a to 0xff (first hex digit >=8, second digit >=a). The
mysteries of EBCDIC/ASCII conversion go beyond the scope of this
document, and in the table that follows I will ignore 8-bit ascii values
below 0xa0 and refer the reader to Dimitri Vulis' excellent document,
which sheds some light on the IBM meaning of the characters 0x80-0x9f
which are reserved in both IS0 8859-1 (Latin-1) and 8859-5 (Cyrillic).
/* From 8859-5 to DKOI-8. ebcdic(isoval) = isotoibm[isoval-160] */
int isotoibm[96] = {
0x41,0xaa,0x4a,0xb1,0x9f,0xb2,0x6a,0xb5,
0xbd,0xb4,0x9a,0x8a,0x5f,0xca,0xaf,0xbc,
0x90,0x8f,0xea,0xfa,0xbe,0xa0,0xb6,0xb3,
0x9d,0xda,0x9b,0x8b,0xb7,0xb8,0xb9,0xab,
0x64,0x65,0x62,0x66,0x63,0x67,0x9e,0x68,
0x74,0x71,0x72,0x73,0x78,0x75,0x76,0x77,
0xac,0x69,0xed,0xee,0xeb,0xef,0xec,0xbf,
0x80,0xfd,0xfe,0xfb,0xfc,0xad,0xae,0x59,
0x44,0x45,0x42,0x46,0x43,0x47,0x9c,0x48,
0x54,0x51,0x52,0x53,0x58,0x55,0x56,0x57,
0x8c,0x49,0xcd,0xce,0xcb,0xcf,0xcc,0xe1,
0x70,0xdd,0xde,0xdb,0xdc,0x8d,0x8e,0xdf
};
There are minor variations to DKOI, called Cyrillic Extended Code Page
037 (most common on BITNET), CECP 500 (which is the definitive one), the
"JNET" and the "FORTRAN" mappings. The differences between these are
tabulated below. Notice that EBCDIC/DKOI, unlike ASCII, is not uniquely
defined even on the 0-127 range:
8859-5 037 500 JNET FORTRAN
0x21 0x5a 0x4f 0x5a 0x4f exclamation point (bang)
0x5b 0xba 0x4a 0xad 0x4a opening square bracket
0x5d 0xbb 0x5a 0xbd 0x5a closing square bracket
0x5e 0xb0 0x5f 0x5f 0x5f circumflex accent
0x7c 0x4f 0xbb 0x6a 0x4f logical or (vertical bar)
[a2] 0x4a 0xb0 0x43 0x43 centsign (in 037)/capital dje (in 500)
[ac] 0x5f 0xba 0x54 0x54 logical not (in 037)/capital kje (in 500)
0xd5 0xef 0xef 0xbb 0xad small ie
0xe3 0x46 0x46 0x4a 0xbb small u
0xe5 0x47 0x47 0xfc 0xbd small kha
0xfc 0xdc 0xdc 0x6a 0xfc small kje
For the Internet, the most important code seems to be Old KOI-8, widely used
in the Relcom groups (but probably not a whole lot elsewhere). Old KOI-8
(GOST 19768-74) from 1974 more or less follows Latin transliteration order
and does not include upper-case hard sign, or letters common to other Slavic
Cyrillic alphabets (Bulgarian, Macedonian, Serbian, Ukrainian...). In the
0-127 range it is identical with ascii, and for the 192-254 region see the
transtabl array above. Some software, including uunpack (also used in
Sergej Ryzhkov's bml, aka Beauty Mail system for PCs) which is distributed
by Relcom, force upper-case hard sign to 255, others (and the standard!)
declare this incorrect, or perhaps reserve 255 for DEL. In an earlier
version of Andrew Hume's <andrew@research.att.com> tcs, which supports
conversion across a wide variety of Cyrillic encodings, this was called the
"mystery DOS Cyrillic encoding", except that his sha and shcha seem to be
interchanged. Tcs is available for anon ftp from research.att.com in
directory /dist/tcs.shar.Z. The semantics of 128-191 in Old KOI is unclear
to me. If there is an official code page (it was suggested that Xenix users
might have one), please post it.
For the PC community, Code Page 866 seems to be quite important. This is
what Microsoft is using in its russified version of MS-DOS. In 0-31
ascii control chars are replaced by a random selection of dingbats. In
32-126 it is identical to ascii, and in 127 it has something that looks
like a little house (the interpretation of such positions seems to be
subject to much uncertainty). The Russian part (128-255) is identical to
Brjabrin's alternativnyj variant, except for 242-251, where some of the
accents/symbols of AV are replaced by non-Russian Cyrillic characters
and other symbols. Unfortunately CP 866 covers only Ukrainian and
Belorussian, with the vague suggestion that e.g. Macedonian users could
redefine the six non-Russian Cyrillic positions. This problem is
largely resolved in Code Page 1251, the Microsoft Cyrillic Windows 3.1
character set, (also endorsed by WordPerfect and Adobe), which contains
all Cyrillic letters used by modern Slavic languages. CP 1251 is fully
compatible with ascii on 0-127 (leaves control positions undefined), has
the Russian alphabet (in order, but without io) in 192-256, and puts the
non-Russian Cyrillic, Russian io, and a few symbols in 128-191.
Brjabrin's Alternativnyj Variant (AV) is also widely used on PCs. It
has Russian in 128 to 175 in alphabetical order except for yo, graphics
characters in 176 to 223, again Russian in 224-241. The same set of
graphics characters, but not in the same order, is used in Brajabin's
Osnovnoj Variant: they are similar to, but not identical with, IBM
Extended ASCII graphics chars (neither the set of shapes nor the code
values are the exact same). AV and OV have no non-Russian Cyrillic or
accented characters, but four accent marks are provided: 242 (acute
below the symbol), 243 (grave below the symbol), 244 (acute above the
symbol), and 245 (grave above the symbol). These, as well as upper case
and lower case yo, codes 240 and 241, are in the same position in
Osnovnoj Variant as well. Codes 246 - 249 are arrows, pointing right,
left, down, up, in that order. Codes 250 and 251 are, in both sets
described by Briabrin, the division sign and the plus/minus sign (the
latter becomes a radical sign in 866). 252 is the Number symbol, 253 is
a sunburst, and 254 is "end of proof". 255 is in principle unused -- in
practice people put things there.
For the academic community, the lack of accents is remedied by the
Academic version of AV developed at Cornell, which includes upper and
lower case acute-accented vowels, and lower case grave-accented vowels.
These replace all but six of the graphics characters (the six that were
retained are those that are necessary for drawing a single-line box).
The accented vowels in this set include a grave-accented lower case yo.
Also included are the letters with diacritics used in French, German,
and Spanish. The complete chart and DOS/Windows software may be
requested from Exceller Software Corp. 800-426-0444. (This is NOT a
product endorsement -- I haven't even seen the stuff!) Cornell also
developed an Academic version of CP1251. In this, non-Russian Slavic
languages are not supported: their letters have been replaced by Russian
accented vowels. These include upper and lower case acute-accented
vowels, and lower case grave-accented vowels. Also included are upper
and lower case grave-accented yo. The AcademicFont Cyrillic character
set was developed by University Microcomputers, who pioneered the use of
Slavic languages on IBM-compatible computers in the US in the
mid-eighties. This set is included among the 11 sets in Exceller's
product. It supports Slavic and some non-Slavic languages, but not
accented vowels.
For the Macintosh community, there is a separate code page. It is ascii
below 128, has the Russian capital letters in 128-159 in alphabetical order
(as usual, io is treated separately) and the Russian lowercase letters in
240-254, but lower case ja is moved to 239, its place taken by the sunburst
symbol. In the 160-238 range we finde the same set of (ISO 8859-5)
non-Russian Cyrillic characters as in CP 1251. The symbols that appear here
are also largely the same as in 1251, but the orderings are completely
different and a few symbols are unique to one or the other, e.g. permille
in 1251, capital delta in the Mac encoding. While a Macintosh version
capable of character conversion is still on the drawing boards, for most
other platforms Columbia Kermit is capable of converting between a large
variety of Cyrilic encodings. Anon ftp to watsun.cc.columbia.edu: for
C-Kermit 5A(188) (Unix, VMS, OS/2, Amiga etc) get file kermit/b/ckaaaa.hlp,
read it, take it from there. For MS-DOS Kermit 3.11, get (in binary mode)
kermit/bin/msvibm.zip, then unzip. For IBM Mainframe Kermit 4.2 and later,
get kermit/b/ik0*.* plus one of the following: kermit/b/ikc*.* for VM/CMS,
kermit/b/ikt*.* for MVS, kermit/b/ikx*.* for CICS or kermit/b/ikm*.* for
MUSIC. There is also a large collection of character-set tables under
kermit/charsets.
Finally, the most broadly accepted standard outside these communities seems
to be GOSTSCI (GOSTCII), a term used colloquially to refer to Brjabrin's
Osnovnoj Variant or to ISO 8859-5 (which is also ECMA 113), although these
two are not identical when it comes to non-Russian Cyrillic. The term "New
KOI-8" means the 1987 revision of KOI-8 (GOST 19768-87) -- all these use the
same (alphabetical, except for yo) order as 8859/5, starting with A at 176.
However, the non-Russian Cyrillic characters (160-176 and 240-255 in new
KOI-8) are not part of OV, their space is taken up by some graphics chars
described for AV above. ISO 8859-5 provides for the Cyrillic characters
required for writing all major Slavic Cyrillic alphabets (Belorussian,
Bulgarian, Macedonian, Serbian, Ukrainian...), but not for those alphabets
that were devised for non-Slavic languages in the Soviet Union (Abkhazian,
Bashkir, Chukchee, Khanty, Tajik, ....), or archaic letters.
Q: Is this a big mess or what?
A: To straighten this out, it seems necessary to adopt a fixed point of
reference, which I take to be Unicode V1.1 = ISO 10646-1.2. While in
principle 10646 is a four-byte standard and Unicode uses 16-bit integers,
the "Basic Multilingual Plane" of 10646 is by definition identical to the
values assigned in Unicode 1.1, both being two-byte quantities (called UCS-2
by ISO). The following list gives the essential part of the names of the
Cyrillic characters and the last two hex digits of their Unicode/10646
encoding.
For reasons of space, the official Unicode/10646 names have been
abbreviated. For a full list of names, anon ftp to unicode.org, cd to
pub/MappingTables, and get namesall.lst (which is slightly over 200k). To
get back the full official name from the abbreviations, always add the
prefix CYRILLIC, unless the position is UNUSED. Further, expand CAP (SMA) to
CAPITAL (SMALL). Finally, the word LETTER should be added after CAP/SMA,
unless it is THOUSANDS, LIGATURE, or COMBINING. The numerical code values
given in the second column have also been abbreviated to the last two
digits, since the preceding two hex digits (really signifying "Cyrillic")
are always 04 in Unicode/10646.
The third column gives the-two character mnemonic abbreviations suggested in
Keld Simonsen's RFC1345 where they exist, to facilitate cross-reference to
this document (available by anon ftp e.g. from sunsite.unc.edu as
/pub/doc/rfp/rfp1345.txt.Z) which has tables for Serbian, Macedonian, as
well as other Cyrillic encodings (IBM CP 880, INIS-cyrillic = ISO-IR-51,
ECMA-cyrillic = ISO-IR-111) whose domain of usage is unclear to me, and
whose table for Old KOI seems to be in fact a New KOI table. I will add
conversion tables for these (or for any other) encodings provided a real
user community exists and actually generates some public domain
machine-readable texts.
UNUSED 00
CAP IO 01 IO
CAP DJE 02 D%
CAP GJE 03 G%
CAP E 04 IE
CAP DZE 05 DS
CAP I 06 II
CAP YI 07 YI
CAP JE 08 J%
CAP LJE 09 LJ
CAP NJE 0A NJ
CAP TSHE 0B Ts
CAP KJE 0C KJ
UNUSED 0D
CAP SHORT U 0E V%
CAP DZHE 0F DZ
CAP A 10 A=
CAP BE 11 B=
CAP VE 12 V=
CAP GE 13 G=
CAP DE 14 D=
CAP IE 15 E=
CAP ZHE 16 Z%
CAP ZE 17 Z=
CAP II 18 I=
CAP SHORT II 19 J=
CAP KA 1A K=
CAP EL 1B L=
CAP EM 1C M=
CAP EN 1D N=
CAP O 1E O=
CAP PE 1F P=
CAP ER 20 R=
CAP ES 21 S=
CAP TE 22 T=
CAP U 23 U=
CAP EF 24 F=
CAP KHA 25 H=
CAP TSE 26 C=
CAP CHE 27 C%
CAP SHA 28 S%
CAP SHCHA 29 Sc
CAP HARD SIGN 2A ="
CAP YERI 2B Y=
CAP SOFT SIGN 2C %"
CAP REVERSED E 2D JE
CAP IU 2E JU
CAP IA 2F JA
SMA A 30 a=
SMA BE 31 b=
SMA VE 32 v=
SMA GE 33 g=
SMA DE 34 d=
SMA IE 35 e=
SMA ZHE 36 z%
SMA ZE 37 z=
SMA II 38 i=
SMA SHORT II 39 j=
SMA KA 3A k=
SMA EL 3B l=
SMA EM 3C m=
SMA EN 3D n=
SMA O 3E o=
SMA PE 3F p=
SMA ER 40 r=
SMA ES 41 s=
SMA TE 42 t=
SMA U 43 u=
SMA EF 44 f=
SMA KHA 45 h=
SMA TSE 46 c=
SMA CHE 47 c%
SMA SHA 48 s%
SMA SHCHA 49 sc
SMA HARD SIGN 4A ='
SMA YERI 4B y=
SMA SOFT SIGN 4C %'
SMA REVERSED E 4D je
SMA IU 4E ju
SMA IA 4F ja
UNUSED 50
SMA IO 51 io
SMA DJE 52 d%
SMA GJE 53 g%
SMA E 54 ie
SMA DZE 55 ds
SMA I 56 ii
SMA YI 57 yi
SMA JE 58 j%
SMA LJE 59 lj
SMA NJE 5A nj
SMA TSHE 5B ts
SMA KJE 5C kj
UNUSED 5D
SMA SHORT U 5E v%
SMA DZHE 5F dz
CAP OMEGA 60
SMA OMEGA 61
CAP YAT 62 Y3
SMA YAT 63 y3
CAP IOTIFIED E 64
SMA IOTIFIED E 65
CAP LITTLE YUS 66
SMA LITTLE YUS 67
CAP IOTIFIED LITTLE YUS 68
SMA IOTIFIED LITTLE YUS 69
CAP BIG YUS 6A O3
SMA BIG YUS 6B o3
CAP IOTIFIED BIG YUS 6C
SMA IOTIFIED BIG YUS 6D
CAP KSI 6E
SMA KSI 6F
CAP PSI 70
SMA PSI 71
CAP FITA 72 F3
SMA FITA 73 f3
CAP IZHITSA 74 V3
SMA IZHITSA 75 v3
CAP IZHITSA DOUBLE GRAVE 76
SMA IZHITSA DOUBLE GRAVE 77
CAP UK DIGRAPH 78
SMA UK DIGRAPH 79
CAP ROUND OMEGA 7A
SMA ROUND OMEGA 7B
CAP OMEGA TITLO 7C
SMA OMEGA TITLO 7D
CAP OT 7E
SMA OT 7F
CAP KOPPA 80 C3
SMA KOPPA 81 c3
THOUSANDS SIGN 82
NON-SPACING TITLO 83
NON-SPACING PALATALIZATION 84
NON-SPACING DASIA PNEUMATA 85
NON-SPACING PSILI PNEUMATA 86
UNUSED 87
UNUSED 88
UNUSED 89
UNUSED 8A
UNUSED 8B
UNUSED 8C
UNUSED 8D
UNUSED 8E
UNUSED 8F
CAP GE WITH UPTURN 90 G3
SMA GE WITH UPTURN 91 g3
CAP GE BAR 92
SMA GE BAR 93
CAP GE HOOK 94
SMA GE HOOK 95
CAP ZHE WITH RIGHT DESCENDER 96
SMA ZHE WITH RIGHT DESCENDER 97
CAP ZE CEDILLA 98
SMA ZE CEDILLA 99
CAP KA WITH RIGHT DESCENDER 9A
SMA KA WITH RIGHT DESCENDER 9B
CAP KA VERTICAL BAR 9C
SMA KA VERTICAL BAR 9D
CAP KA BAR 9E
SMA KA BAR 9F
CAP REVERSED GE KA A0
SMA REVERSED GE KA A1
CAP EN WITH RIGHT DESCENDER A2
SMA EN WITH RIGHT DESCENDER A3
CAP EN GE A4
SMA EN GE A5
CAP PE HOOK A6
SMA PE HOOK A7
CAP O HOOK A8
SMA O HOOK A9
CAP ES CEDILLA AA
SMA ES CEDILLA AB
CAP TE WITH RIGHT DESCENDER AC
SMA TE WITH RIGHT DESCENDER AD
CAP STRAIGHT U AE
SMA STRAIGHT U AF
CAP STRAIGHT U BAR B0
SMA STRAIGHT U BAR B1
CAP KHA WITH RIGHT DESCENDER B2
SMA KHA WITH RIGHT DESCENDER B3
CAP TE TSE B4
SMA TE TSE B5
CAP CHE WITH RIGHT DESCENDER B6
SMA CHE WITH RIGHT DESCENDER B7
CAP CHE VERTICAL BAR B8
SMA CHE VERTICAL BAR B9
CAP H BA
SMA H BB
CAP IE HOOK BC
SMA IE HOOK BD
CAP IE HOOK OGONEK BE
SMA IE HOOK OGONEK BF
PALOCHKA C0
CAP SHORT ZHE C1
SMA SHORT ZHE C2
CAP KA HOOK C3
SMA KA HOOK C4
UNUSED C5
UNUSED C6
CAP EN HOOK C7
SMA EN HOOK C8
UNUSED C9
UNUSED CA
CAP CHE WITH LEFT DESCENDER CB
SMA CHE WITH LEFT DESCENDER CC
UNUSED CD
UNUSED CE
UNUSED CF
CAP A WITH BREVE D0
SMA A WITH BREVE D1
CAP A WITH DIAERESIS D2
SMA A WITH DIAERESIS D3
CAP LIGATURE A IE D4
SMA LIGATURE A IE D5
CAP IE WITH BREVE D6
SMA IE WITH BREVE D7
CAP SCHWA D8
SMA SCHWA D9
CAP SCHWA WITH DIAERESIS DA
SMA SCHWA WITH DIAERESIS DB
CAP ZHE WITH DIAERESIS DC
SMA ZHE WITH DIAERESIS DD
CAP ZE WITH DIAERESIS DE
SMA ZE WITH DIAERESIS DF
CAP ABKHASIAN DZE E0
SMA ABKHASIAN DZE E1
CAP I WITH MACRON E2
SMA I WITH MACRON E3
CAP I WITH DIAERESIS E4
SMA I WITH DIAERESIS E5
CAP O WITH DIAERESIS E6
SMA O WITH DIAERESIS E7
CAP BARRED O E8
SMA BARRED O E9
CAP BARRED O WITH DIAERESIS EA
SMA BARRED O WITH DIAERESIS EB
CAP U WITH ACUTE EC
SMA U WITH ACUTE ED
CAP U WITH MACRON EE
SMA U WITH MACRON EF
CAP U WITH DIAERESIS F0
SMA U WITH DIAERESIS F1
CAP U WITH DOUBLE ACUTE F2
SMA U WITH DOUBLE ACUTE F3
CAP CHE WITH DIAERESIS F4
SMA CHE WITH DIAERESIS F5
CAP DJE WITH ACUTE F6
SMA DJE WITH ACUTE F7
CAP YERU WITH DIAERESIS F8
SMA YERU WITH DIAERESIS F9
UNUSED FA
UNUSED FB
UNUSED FC
UNUSED FD
UNUSED FE
UNUSED FF
Q: Is everything clear now?
A: Probably not. To ease the pain, here follow some tentative conversion
tables *from* the 8-bit schemes described above *to* Unicode. Since the
Unicode/10646 character set is much larger, no tables are provided in
the other direction.
In the 0-127 range everything is ASCII (except for the CP866 dingbats in
the range 0-31 which are at any rate optional, and for EBCDIC/DKOI-8, for
which see above) so here tables are only provided for 128-255. Notice
that often values other than starting with 0x04 are given, meaning that
the Unicode equivalent is outside the Unicode Cyrillic range
0x0400-0x04ff, but included at some other place, typically among the
arrows (0x2190-0x21ff) or other semigraphic material (0x2500-0x25ff). If
a particular encoding leaves (by official definition, not necessarily in
practical usage) some code unused, this is designated by "-1" in the
conversion table. For some positions the tables show a "-2", meaning
that I have no information on the intended meaning. (This is not the
same as there being no Unicode codepoint for the character in question,
a situation we potentially encounter with AV and OV 242-245, see note
there.)
/* From old Koi-8 to Unicode */
long oldkoi8tou[128] = {
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
0x044e,0x0430,0x0431,0x0446,0x0434,0x0435,0x0444,0x0433,
0x0445,0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,
0x043f,0x044f,0x0440,0x0441,0x0442,0x0443,0x0436,0x0432,
0x044c,0x044b,0x0437,0x0448,0x044d,0x0449,0x0447,0x044a,
0x042e,0x0410,0x0411,0x0426,0x0414,0x0415,0x0424,0x0413,
0x0425,0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,
0x041f,0x042f,0x0420,0x0421,0x0422,0x0423,0x0416,0x0412,
0x042c,0x042b,0x0417,0x0428,0x042d,0x0429,0x0427,0x042a
};
/* From CP866 to Unicode */
long cp866tou[128] = {
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556,
0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510,
0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f,
0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567,
0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b,
0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x0401,0x0451,0x0404,0x0454,0x0407,0x0457,0x040e,0x045e,
0x00b0,0x2022,0x00b7,0x221a,0x2116,0x00a4,0x25a0, -1
};
/* From CP1251 to Unicode */
long cp1251tou[128] = {
0x0402,0x0403,0x201a,0x0453,0x201e,0x2026,0x2020,0x2021,
-1,0x2030,0x0409,0x2039,0x040a,0x040c,0x040b,0x040f,
0x0452,0x2018,0x2019,0x201c,0x201d,0x2022,0x2013,0x2014,
-1,0x2122,0x0459,0x203a,0x045a,0x045c,0x045b,0x045f,
0x00a0,0x040e,0x045e,0x0408,0x00a4,0x0490,0x00a6,0x00a7,
0x0401,0x00a9,0x0404,0x00ab,0x00ac,0x00ad,0x00ae,0x0407,
0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7,
0x0451,0x2116,0x0454,0x00bb,0x0458,0x0405,0x0455,0x0457,
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
};
/* From Mac to Unicode */
long mactou[128] = {
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x2020,0x00b0,0x0490,0x00a3,0x00a7,0x2022,0x00b6,0x0406,
0x00ae,0x00a9,0x2122,0x0402,0x0452,0x2260,0x0403,0x0453,
0x221e,0x00b1,0x2264,0x2265,0x0456,0x03bc,0x0491,0x0408,
0x0404,0x0454,0x0407,0x0457,0x0409,0x0459,0x040a,0x045a,
0x0458,0x0405,0x00ac,0x221a,0x0192,0x2248,0x0394,0x00ab,
0x00bb,0x2026,0x0020,0x040b,0x045b,0x040c,0x045c,0x0455,
0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7,
0x040e,0x045e,0x040f,0x045f,0x2116,0x0401,0x0451,0x044f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x00a4,
};
/* From Alternativnyj Variant to Unicode */
long avtou[128] = {
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556,
0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510,
0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f,
0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567,
0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b,
0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190,
0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0, -1
};
/* The interpretation of the four symbols following the second
alphabetic block in AV remains unclear. One suggestion was to treat
these as (non-spacing) grave and acute, as appearing above upper- or
lowercase letters, but the graphical rendering in Briabin's original
article makes clear that the distinction is between acute and grave,
above or below the letter: this is what the table now has.
But the preponderance of graphical symbols in AV suggests that the
intention was to provide facilities for character graphics, in which
case the interpretation is simply straight lines connecting two
adjacent midpoints of the bounding box. If the box is the unit
square, these would run from (.5,0) to (0,.5) and to (1,.5), and from
(.5,1) to (0,.5) and to (1,.5), in this order. (The line segments are
of course directionless.) Such symbols are not present in Unicode --
the closest things are 0x25de 0x25df 0x25dc 0x25dd (in this order) but
these are curved, not straight.
Whether the graphics or the accent usage is more prevalent in actual
usage only those plugged into the Russian PC community can tell. If
the graphics usage turns out to be prevalent, these four symbols would
be reasonable candidates for incorporation into Unicode, perhaps at
positions 0x25ef to 0x25f3. */
/* From Osnovnoj Variant to Unicode */
long ovtou[128] = {
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2,
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190,
0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0, -1
};
/* The same problem with the interpretation of 242-245 as in AV (these
rows are definitely identical). The low positions of OV are probably
identical to 176-223 in AV... */
/* From ISO8859-5 to Unicode */
long newkoi8tou[128] = {
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
0x00a0,0x0401,0x0402,0x0403,0x0404,0x0405,0x0406,0x0407,
0x0408,0x0409,0x040a,0x040b,0x040c,0x00ad,0x040e,0x040f,
0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
0x2116,0x0451,0x0452,0x0453,0x0454,0x0455,0x0456,0x0457,
0x0458,0x0459,0x045a,0x00a7,0x045c,0x045d,0x045e,0x045f
};
/* Use newkoi8tou in combination with isotoibm to derive the unicode
meaning of the Cyrillic range in the DKOI extension of EBCDIC. If
someone has DKOI-8 text available, I'd love to actually try... */